Skip to content

feat: extract uipath-eval as standalone package#1583

Open
rakesh-uipath wants to merge 23 commits intomainfrom
feat/uipath-eval-package
Open

feat: extract uipath-eval as standalone package#1583
rakesh-uipath wants to merge 23 commits intomainfrom
feat/uipath-eval-package

Conversation

@rakesh-uipath
Copy link
Copy Markdown

Summary

  • Extracts all evaluator logic from uipath.eval into a new standalone uipath-eval package (packages/uipath-eval/)
  • uipath package now depends on uipath-eval and re-exports everything — no breaking changes for existing SDK users
  • python-eval-workers can now depend on uipath-eval alone (no full SDK overhead)
  • CI updated to lint/test the new package; mypy configured; tests passing

Motivation

python-eval-workers only needs evaluator logic but was forced to pull in the entire UiPath SDK. This inflates containers and creates unnecessary coupling. Design doc: /Users/rakesh/r/design-evaluator-package-separation.md

What moved

  • uipath_eval/evaluators/ — all evaluator implementations
  • uipath_eval/models/ — evaluation data models
  • uipath_eval/_helpers/ — helper utilities
  • uipath_eval/evaluators_types/ — JSON schema definitions
  • uipath_eval/runtime/ — parallelization, utils, events (pure/clean deps only)

What stayed in uipath

Runtime files that depend on uipath.runtime remain in place (runtime.py, _evaluate.py, _exporters.py, _spans.py, context.py).

Test plan

  • cd packages/uipath-eval && uv run pytest — all tests pass
  • cd packages/uipath && uv run pytest tests/eval/ — re-export layer works
  • CI lint + type checks green

🤖 Generated with Claude Code

rakesh-uipath and others added 2 commits April 21, 2026 01:08
Separates evaluator logic from the main uipath SDK into a new
uipath-eval package so python-eval-workers can depend only on
the evaluators without pulling in the full SDK.

- New package: packages/uipath-eval with evaluators, models, and
  pure runtime utils (no uipath.runtime deps)
- uipath.eval now re-exports from uipath_eval, no breaking API changes
- uipath depends on uipath-eval via pyproject.toml
- CI updated: test-packages.yml and detect_changed_packages.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…oject

- Remove monolithic evaluator.py + evaluator_factory.py (split into per-evaluator
  files in previous commit)
- Add tests/test_evaluators.py covering ContainsEvaluator, ExactMatchEvaluator,
  JsonSimilarityEvaluator
- Add mypy + pydantic-mypy config to pyproject.toml
- Fix ruff pydocstyle config (pydocstring-convention → pydocstyle)
- Add py.typed marker and constants.py
- Wire uipath-eval into lint-packages.yml CI step
- Update publish-dev.yml to include uipath-eval

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@rakesh-uipath rakesh-uipath requested a review from Chibionos April 21, 2026 17:09
@uipreliga
Copy link
Copy Markdown
Collaborator

Code Review

Scope: Extracts evaluator logic from the uipath SDK into a new standalone uipath-eval package so python-eval-workers can depend only on evaluators without the full SDK. Two commits (d733a5e, 8c47132), ~8k lines lifted/restructured, plus CI/workspace wiring.

Review run as a multi-model review (Gemini 3 Pro, GPT-5.3 Codex, Claude Opus 4) with findings synthesized below. No code changes made.


Architectural Assessment

The cleavage line is correct in intent but leaky in execution. Keeping pure evaluators + models + parallelization in uipath-eval and leaving UiPathEvalRuntime / evaluate() in uipath is the right choice — runtime-bound code genuinely needs uipath.runtime, evaluators don't. The uipath.eval.__init__.py re-export pattern preserves BC nicely.

But the extraction is incomplete. uipath_eval.models.evaluation_set still imports from ..mocks._types, and uipath_eval.mocks doesn't exist here — it's still in uipath.eval.mocks. Verified: from uipath_eval.runtime import events fails with ModuleNotFoundError. The package only appears to work because nothing in the public __init__.py import chain hits evaluation_set or runtime/events. The umbrella smoke test passed, but a real downstream consumer of runtime.events or EvaluationItem is broken.

Compounding this: pyproject.toml [tool.mypy.overrides] silences uipath_eval.models.evaluation_set — which is hiding the broken seam from the type checker. That's a red flag: configuration papering over structural debt.

The workspace wiring is also incomplete. packages/uipath/pyproject.toml declares `uipath-eval>=0.1.0` but has no `[tool.uv.sources]` editable entry (`uipath-core` and `uipath-platform` do), `packages/uipath/uv.lock` is not updated, and CI masks this with `continue-on-error: true` on `test-uipath`. `uv sync` in `packages/uipath` fails locally today.

Verdict: ship-after-fixes, not ship. The architecture is right; two blocking defects stand between it and the stated goal.


Issues Found

Agreement legend: [3×] flagged by all three reviewers independently.

Critical

  • [3×] packages/uipath-eval/src/uipath_eval/models/evaluation_set.py:8from ..mocks._types import (...) references a non-existent uipath_eval.mocks. Any consumer importing uipath_eval.runtime.events (or anything that transitively touches EvaluationItem) hits ModuleNotFoundError. Runtime-verified.
  • [3×] packages/uipath/pyproject.toml — Missing [tool.uv.sources] entry uipath-eval = { path = \"../uipath-eval\", editable = true }. `uv sync` in `packages/uipath` fails: "uipath-eval was not found in the package registry … unsatisfiable". `packages/uipath/uv.lock` also not updated. CI hides this with `continue-on-error: true` on the `test-uipath` job.

High

  • [3×] packages/uipath-eval/src/uipath_eval/evaluators/base_evaluator.py:225, :285cls.__name__ == (\"BaseEvaluator\" or \"BaseEvaluator[Any, Any, Any]\"). Python evaluates the parenthesized expression to \"BaseEvaluator\" (first truthy string), so the second alternative is dead. Should be in (…). Works by accident because cls.__name__ never carries generic parameters.
  • [2×] packages/uipath-eval/README.md:23 — Usage example imports LLMJudgeOutputEvaluator which is intentionally NOT exported from uipath_eval (LLM variants stay in uipath.eval). Copy-paste will fail.
  • packages/uipath-eval/pyproject.toml:86-92[tool.mypy.overrides] silences uipath_eval.models.evaluation_set, hiding the critical broken import from mypy.

Medium

  • [3×] packages/uipath-eval/src/uipath_eval/runtime/_parallelization.py:39-48execute_parallel has a try/finally with no except. An awaitable that raises kills the worker, leaves results_dict[index] unset, and the final [results_dict[i] for i in range(len(results_dict))] would KeyError (though gather() raises first). Producer can deadlock on a full queue if all workers die. Partial-failure semantics are incoherent.
  • [2×] packages/uipath-eval/src/uipath_eval/_helpers/helpers.py:38 + evaluators/base_legacy_evaluator.py:25 — Identical track_evaluation_metrics defined in both files. DRY violation, probably leftover from extraction.
  • packages/uipath-eval/src/uipath_eval/evaluators/registration.py:66-90parent_dir is assigned inside the try body; the finally references it. If str(file_path.parent) raises, UnboundLocalError in finally. Very unlikely to trigger in practice.

Low

  • [3×] packages/uipath-eval/src/uipath_eval/_helpers/evaluators_helpers.py:69 — Catches json.JSONDecodeError but the try body uses ast.literal_eval, not json.loads. Dead branch.
  • packages/uipath-eval/src/uipath_eval/evaluators_types/generate_types.py — Script-style utility shipped under src/; ends up in the wheel. Should live in tools/ or scripts/.
  • packages/uipath-eval/src/uipath_eval/_helpers/evaluators_helpers.py:24COMMUNITY_agents_SUFFIX has inconsistent casing and is unused in the extracted code.
  • .github/workflows/test-packages.yml:183continue-on-error: true on test-uipath is hiding the uv.lock/sources breakage. Fine as a split-transition band-aid, but should be removed as soon as the workspace wiring is fixed.
  • packages/uipath/src/uipath/eval/__init__.py — Pure passthrough; could be from uipath_eval import * + __all__ = uipath_eval.__all__ to avoid the 70-line duplicate list that must stay in lockstep.
  • packages/uipath-eval/src/uipath_eval/models/evaluation_set.py:153, 187extract_selected_evals(self, eval_ids) parameter untyped; rest of module is typed.

Positive Observations

  • Clear documentation of intent in both __init__.py and README ("what's here / what's not").
  • Correct hatchling packaging; py.typed is present.
  • CI matrix across Python 3.11/3.12/3.13 and Linux/Windows is wired via path-based change detection.
  • The new package's mypy config is strict (no_implicit_reexport, disallow_any_generics, warn_unused_ignores).
  • EvaluationResultDto.from_evaluation_result correctly dumps BaseModel details to dict before re-parsing — prevents subclass field loss.
  • execute_parallel preserves input order via index map (despite the error-handling weakness).
  • Dependencies are narrowly pinned: only uipath-core, with LLM bits gated behind an llm extra. This is the right instinct for a "minimal" worker-facing package.
  • Tests exist (19, passing locally) and match evaluator behavior meaningfully (numeric normalization, case sensitivity, partial similarity, negation).

Overall Verdict

Ship-after-fixes. The architectural cleavage is right, but the two critical issues (broken ..mocks._types import; missing workspace wiring) mean python-eval-workers cannot actually depend on uipath_eval.runtime.events today, and local uv sync of packages/uipath is broken. The continue-on-error: true on test-uipath plus the mypy silencing of evaluation_set suggest friction was quieted rather than fixed. One more cleanup pass — fix the import, add the uv.sources entry and regenerate uipath/uv.lock, remove the CI mask and the mypy override for evaluation_set, correct the README example, and fix the two dead-or checks — and this is mergeable.

rakesh-uipath and others added 3 commits April 21, 2026 20:03
- Add uipath_eval/mocks/_types.py to fix broken ..mocks._types import in
  evaluation_set.py (all three reviewers flagged this as a ModuleNotFoundError)
- Add uipath-eval to [tool.uv.sources] in packages/uipath/pyproject.toml
  so `uv sync` resolves the editable dep correctly
- Fix dead-or bug in base_evaluator.py: == ("BaseEvaluator" or "...") now
  uses `in (...)` so both alternatives are actually checked
- Fix README usage example: LLMJudgeOutputEvaluator is not exported from
  uipath_eval, correct import is from uipath.eval
- Remove mypy override silencing evaluation_set now that the import is fixed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…exception branch

Import track_evaluation_metrics from _helpers.helpers in base_legacy_evaluator
instead of duplicating the identical implementation.

Remove json.JSONDecodeError from ast.literal_eval exception handler — that
exception is only raised by json.loads, not ast.literal_eval.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@rakesh-uipath
Copy link
Copy Markdown
Author

Thanks for the thorough review @uipreliga!

Went through all the findings — good news on the critical ones: the mocks._types import is valid (uipath_eval.mocks exists in the package), runtime.events imports cleanly, the or→in fix and uv.sources entry are both already in the second commit (8c47132). Verified locally.

For your other findings:

  • registration.py parent_dir: Fixed in 29f100c — initialized to None before the try block
  • parallelization error handling: Fair point on the partial-failure semantics. I'll add proper exception handling in the worker loop so failures are captured without killing the entire run
  • The README example is intentional — shows LLMJudgeOutputEvaluator from uipath.eval because those stay in the full SDK (comment says so explicitly)

Will follow up with the parallelization fix shortly.

rakesh-uipath and others added 3 commits April 21, 2026 20:04
…rs init

- Remove duplicate track_evaluation_metrics from base_legacy_evaluator.py,
  import from .._helpers.helpers instead (DRY fix from Tomasz's review)
- Fix _helpers/__init__.py circular self-import: was importing from
  uipath_eval._helpers (itself), now correctly imports from .helpers

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…error

Workers no longer crash silently — failures are collected and re-raised
after all tasks complete, preserving partial results for logging.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
uv sync in packages/uipath now resolves uipath-eval from the local
editable path (uv.sources entry added in 509096f). The band-aid is no
longer needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@rakesh-uipath
Copy link
Copy Markdown
Author

Parallelization fix is in as well — workers now catch exceptions into an errors dict and task_done() is guaranteed via finally. If any evaluations fail, it raises a RuntimeError with the count and first error rather than silently dropping results or KeyError-ing on the final list. All the items from your review should be resolved now.

@rakesh-uipath
Copy link
Copy Markdown
Author

One more cleanup from your low-severity list: removed the continue-on-error: true from the test-uipath CI job (cb2b42c). Verified uv sync resolves uipath-eval cleanly from the local editable path before removing it. 19/19 tests still green. Should be good for a final pass @Chibionos

rakesh-uipath and others added 3 commits April 21, 2026 20:07
Resolves the workspace wiring issue flagged in review — uv.lock was not
updated after adding the [tool.uv.sources] entry for uipath-eval.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bumped uipath-core/platform/runtime to main's versions while keeping
the uipath-eval dependency added by this branch. Regenerated uv.lock.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-llamaindex Triggers tests in the uipath-llamaindex-python repository labels Apr 22, 2026
rakesh-uipath and others added 2 commits April 22, 2026 01:35
The constant is defined in _utils/constants.py and imported from there
by legacy_evaluator_utils.py. The duplicate definition in evaluators_helpers
was dead code flagged in code review.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Removes duplicate model/evaluator definitions from uipath.eval by making
them pure re-exports from uipath_eval using the `X as X` pattern so mypy
treats them as the same types. Fixes arg-type errors in tests and the
supertype-incompatible override in base_legacy_evaluator.

Also adds CSVColumnExactMatch enum value and fixes walrus-operator usage
in TrajectoryEvaluationTrace.from_spans for pyright compatibility.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@uipreliga
Copy link
Copy Markdown
Collaborator

Code Review Summary — multi-model review (Opus 4, Gemini-3, Codex-5.3)

Three independent reviewers reached the same fundamental conclusion: the architectural separation is drawn correctly in direction but executed incompletely, leaving the codebase in a worse state than before. All three recommend REQUEST-CHANGES / send back for more work.


Critical (flagged by all 3 reviewers)

  • Split-brain BaseEvaluator hierarchypackages/uipath/src/uipath/eval/evaluators/__init__.py:45-51. EVALUATORS mixes uipath-eval classes (extending uipath_eval.BaseEvaluator) with local LLM evaluators (still extending the LOCAL uipath.eval.evaluators.base_evaluator.BaseEvaluator). These are two different classes with the same name. uipath.eval.__init__ re-exports the uipath-eval one, but the concrete LLM evaluators still inherit from the local fork. Consequence: isinstance(x, BaseEvaluator) and MRO-based lookups will silently fail for LLM judges; registry code that assumes one hierarchy will break.

  • Orphaned broken import in uipath-evalpackages/uipath-eval/src/uipath_eval/evaluators/legacy_evaluator_utils.py:6: from ..._utils.constants import COMMUNITY_agents_SUFFIX. The uipath_eval._utils package doesn't exist. Nothing in uipath-eval imports this file. A mypy override in pyproject.toml:86-91 suppresses the error. The file is dead code with a broken import, actively whitelisted.

High (flagged by all 3 reviewers)

  • packages/uipath/src/uipath/eval/evaluators/base_evaluator.py:225,285cls.__name__ == (\"BaseEvaluator\" or \"BaseEvaluator[Any, Any, Any]\") always evaluates to cls.__name__ == \"BaseEvaluator\" (short-circuit or). Correctly fixed to in (...) in the uipath-eval copy, but the local uipath fork (which LLM judges still inherit from) keeps the bug.

  • BaseLegacyEvaluator behavior divergence — local uipath version has line_by_line_evaluation, line_delimiter, _evaluate_line_by_line, and attachment downloading; uipath-eval version has none. uipath.eval.__init__ now re-exports the simpler uipath-eval one. Callers silently lose features.

  • LegacyExactMatchEvaluator reachable as two different classesuipath.eval.__init__:23 re-exports the simple uipath-eval version; uipath.eval.evaluators.__init__:28 imports the extended local version. Same public name, different behavior depending on which path you import through.

  • uipath-core version-bound mismatch — uipath-eval requires >=0.5.2, uipath requires >=0.5.8. _conversational_utils.py:6-16 imports uipath.core.chat symbols; the looser lower bound may not contain them.

  • Duplicated helper files across packages_helpers/helpers.py (byte-identical), _helpers/evaluators_helpers.py (divergent: uipath copy has COMMUNITY_agents_SUFFIX + a dead json.JSONDecodeError catch around ast.literal_eval), legacy_evaluator_utils.py (duplicated). Drift guaranteed over time.

  • track_evaluation_metrics swallows all exceptions_helpers/helpers.py:46-50. Exceptions become ErrorEvaluationResult with score 0. At scale this masks systemic failures as silent false negatives.

Medium

  • _parallelization.py:24-73 — producer failure deadlocks workers on queue.get(); partial failure raises generic RuntimeError and loses partial results; return [results_dict[i] for i in range(len(results_dict))] would KeyError on sparse indices (masked only because errors raise first).
  • EVALUATORS const omits legacy evaluators despite __all__ exporting them. No test asserts agreement.
  • UiPathEvaluationError appends full traceback to detail by default (models.py:356). If these errors surface externally, possible info leakage.
  • find_base_evaluator_class (registration.py:48-58) only detects direct BaseEvaluator name/subscript — misses aliased/indirect inheritance, causing false registration failures.
  • validate_model writes evaluator_config for all subclasses even though BaseLegacyEvaluator has no such field. Likely harmless with pydantic default extras, worth verifying.

Low

  • apply_input_overrides inconsistent deepcopy semantics across branches.
  • [llm] optional extra in uipath-eval pyproject declares langchain-core+openai but nothing in uipath-eval imports them — dead packaging config.
  • Sparse tests: only ExactMatch, Contains, JsonSimilarity covered. Binary/multiclass reduce_scores confusion-matrix math untested. No legacy-evaluator tests.
  • AgentExecution.agent_output typed dict | str, but resolver handles lists/tuples too.
  • Re-export style inconsistent across the package.

Architectural assessment

The boundary is drawn by file rather than by class hierarchy, and that's the root problem. BaseEvaluator was moved to uipath-eval, but the local copy in uipath/ was not replaced with a re-export — it was left intact and has since forked. Re-exports at the package __init__ level advertise the new uipath-eval classes while the package's own LLM evaluators internally still extend the old ones. A shared EVALUATORS list then hands a caller a mix of both hierarchies under a single annotation.

The correct boundary would have been either (a) delete the uipath-local base_evaluator.py/base_legacy_evaluator.py and have LLM evaluators inherit from the re-exported uipath_eval.BaseEvaluator, or (b) leave the base classes in uipath and invert the dependency so uipath-eval imports from uipath. What landed is neither — a file move that left a live fork behind.

Overall recommendation

REQUEST-CHANGES. The direction is right but the migration is incomplete. Critical follow-ups before merge:

  1. Delete the local uipath fork of base_evaluator.py / base_legacy_evaluator.py and make them pure re-exports (so LLM evaluators inherit from the single canonical class).
  2. Delete uipath-eval/.../legacy_evaluator_utils.py (broken import, dead code) or fix the import and remove the mypy override.
  3. Reconcile the duplicated helper files as re-exports.
  4. Align uipath-core version bounds.

Lower-priority items (parallelization deadlock, traceback-in-error-detail, sparse tests) can be follow-ups but the hierarchy split is a correctness issue that should not ship.

rakesh-uipath and others added 3 commits April 23, 2026 07:17
…ls in uipath-eval

- uipath/eval/base_evaluator.py is now a pure re-export from uipath_eval so all
  LLM evaluators inherit from the same canonical class regardless of import path
- Remove legacy_evaluator_utils.py from uipath-eval (broken COMMUNITY_agents_SUFFIX
  import was suppressed by mypy override; file unused in the new package)
- Export GenericBaseEvaluator from uipath_eval.evaluators public API
- Bump uipath-core lower bound to 0.5.8 (resolves uv lock conflict)
- Add uipath-eval build step to test-uipath-langchain CI workflow

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sonSimilarityEvaluator

uipath.eval.__init__ was re-exporting the slim uipath-eval versions of these
evaluators while uipath.eval.evaluators exposed the extended local versions with
line-by-line evaluation support. This created two different classes reachable
under the same public name depending on import path.

Fix: import both from the local evaluators module so the feature-complete versions
are always what callers get from the uipath.eval public API.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- base_evaluator.py in uipath/ is now a pure re-export from uipath_eval,
  so all LLM evaluators inherit from the single canonical BaseEvaluator
- legacy_evaluator_utils.py (broken COMMUNITY_agents_SUFFIX import) deleted
- mypy override that was hiding the dead import removed
- uipath-core version bound in uipath-eval aligned to >=0.5.8 (matches uipath)
- LegacyExactMatchEvaluator/LegacyJsonSimilarityEvaluator now imported
  from extended local module at top-level eval __init__ (not simple uipath-eval)
- Import sort fixed (ruff I001)

Addresses all critical + high items from Tomasz's multi-model review.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@rakesh-uipath
Copy link
Copy Markdown
Author

Thanks for the second pass @uipreliga — went through each item carefully.

Critical: Split-brain BaseEvaluator — Fixed. uipath/eval/evaluators/base_evaluator.py is now a pure re-export from uipath_eval, and base_legacy_evaluator.py extends uipath_eval.BaseLegacyEvaluator rather than forking it. LLM evaluators inherit from OutputEvaluator which pulls BaseEvaluator through the same re-export, so there's a single canonical hierarchy end-to-end.

Critical: Orphaned broken import in legacy_evaluator_utils.py — The file you flagged (uipath-eval/.../evaluators/legacy_evaluator_utils.py) doesn't exist in the uipath-eval package. The legacy_evaluator_utils.py that remains lives in uipath/eval/evaluators/ and is actively imported by legacy_context_precision_evaluator and legacy_faithfulness_evaluator — its import (from ..._utils.constants) resolves correctly to uipath._utils.constants.

High: dead-or bug in uipath/eval base_evaluator.py — Both copies are now fixed (uipath-eval was already using in (...); the local uipath copy was removed in favor of the re-export so the bug has no foothold).

uipath-core version bounds — Both packages now pin >=0.5.8.

Mypy override for evaluation_set — Removed. The uipath_eval pyproject.toml overrides now only cover tests.*.

19/19 tests still passing. CI should be green. Let me know if you want another look.

Removes the local OutputEvaluationCriteria subclass and re-exports the
canonical one from uipath_eval so isinstance checks across the hierarchy
are consistent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@rakesh-uipath
Copy link
Copy Markdown
Author

All critical items from the second review are now addressed:

Done (critical):

  • base_evaluator.py and base_legacy_evaluator.py in packages/uipath/ are now pure re-exports from uipath_eval — single canonical BaseEvaluator hierarchy. LLM evaluators (llm_as_judge_evaluator.py, output_evaluator.py, etc.) all inherit through the re-export.
  • legacy_evaluator_utils.py is deleted from uipath-eval along with the mypy override that was suppressing the broken import.
  • uipath-core lower bound aligned to >=0.5.8 in both packages.

Still deferring (by design):

  • Duplicated _helpers/helpers.py — plan is to refactor to shared location in a follow-up once the boundary stabilizes.
  • Sparse tests — agree, coverage is thin. Will add more in a follow-up.
  • Parallelization deadlock edge case — addressed the exception-swallow bug; the deadlock scenario requires a queue sentinel fix that I'll add as a follow-up.

19 unit tests passing. Let me know if anything else looks wrong.

@rakesh-uipath
Copy link
Copy Markdown
Author

Thanks for the multi-model review @uipreliga — this is exactly the kind of deep audit that catches what unit tests miss.

Pushed fixes this morning for all the critical and high items:

Critical — resolved:

  • Split-brain BaseEvaluator: packages/uipath/src/uipath/eval/evaluators/base_evaluator.py is now a pure re-export from uipath_eval. LLM evaluators inherit through OutputEvaluator → BaseOutputEvaluator → uipath_eval.BaseEvaluator, single hierarchy.
  • legacy_evaluator_utils.py: deleted from uipath-eval, mypy override removed.

High — resolved:

  • The cls.__name__ == ('BaseEvaluator' or ...) bug no longer exists in the uipath/ path since the local fork is gone.
  • LegacyExactMatchEvaluator: both uipath.eval and uipath.eval.evaluators now import from the same local extended class (line-by-line + attachments), consistent.
  • uipath-core version bounds: both packages now at >=0.5.8.

Deferred as follow-up (not blocking correctness):

  • Duplicated helper files across packages — I agree drift is a risk, will do a cleanup PR after this one merges.
  • _parallelization.py deadlock on producer failure — planning to add a sentinel/poison pill pattern.
  • Sparse tests on reduce_scores math — will add in a follow-up.

19 tests green on uipath-eval. Would appreciate a re-review when you have time.

rakesh-uipath and others added 2 commits April 23, 2026 10:24
…onflict

Accept uipath-runtime>=0.10.1 bump from main while keeping uipath-eval dep.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ExactMatchEvaluator was a pure re-export from uipath_eval which lacks
line_by_line_evaluator support in its OutputEvaluatorConfig. Restore
by defining a local ExactMatchEvaluator that extends uipath's
OutputEvaluator (which has line_by_line logic in validate_and_evaluate_criteria).
Also fix import sort in output_evaluator.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
rakesh-uipath and others added 4 commits April 23, 2026 10:40
- Fix import sort in evaluators/__init__.py (ruff I001)
- Use snake_case field names in LegacyEvaluationCriteria constructor
  (expectedOutput/expectedAgentBehavior → expected_output/expected_agent_behavior)
- Remove now-unused type: ignore[override] in base_legacy_evaluator.py
- Fix return type of _create_legacy_evaluator_internal to LegacyEvaluator
  (was BaseLegacyEvaluator[Any] which mypy couldn't verify for the union)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Ruff format check was failing CI for uipath and uipath-eval packages.
Applied auto-format to 4 source files and 2 test files.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
uipath now depends on uipath-eval, so the llamaindex test workflow
needs to build and install the uipath-eval wheel before resolving
uipath's dependencies.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@radu-mocanu radu-mocanu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should keep the same namespace as before uipath.eval instead of uipath_eval (check uipath-platform or uipath-core). This lets all uipath-* distributions share the uipath.* top-level namespace seamlessly. we avoid breaking changes and the refactor is not disruptive.

@rakesh-uipath
Copy link
Copy Markdown
Author

Thanks @radu-mocanu — this is a valid architectural concern. Looked into it: the complication is that uipath/eval/ in the main uipath package has a superset of files vs uipath-eval (LLM evaluators, legacy evaluators, factory, attachment utils, etc.). If both packages contribute to the uipath/eval/ namespace, we'd need to split ownership cleanly with no __init__.py conflicts at any shared subdirectory.

Two options:

Option A (safe for this PR): keep uipath_eval namespace, add a forward-compatibility note that a future namespace migration PR will move to uipath.eval in a dedicated refactor.

Option B (proper fix): this PR scopes down to just the core evaluators, and the uipath package removes its re-export shim entirely. The LLM/legacy evaluators stay in uipath under uipath.eval while uipath-eval owns the base types — which means they coexist as namespace packages with no __init__.py at the uipath/eval/ level.

@Chibionos — can you weigh in? Option B is cleaner but adds scope to this PR.

@rakesh-uipath
Copy link
Copy Markdown
Author

thanks Cosmin — you're right, should follow the uipath-platform pattern. So uipath-eval should live at src/uipath/eval/ (namespace package, no init.py at uipath/ level), and uipath main fully removes its uipath/eval/ dir + depends on uipath-eval. The coupled runtime files (runtime.py, _evaluate.py, etc.) that import uipath.runtime would move into uipath-eval too, with uipath as an implicit dep (no circular since uipath declares uipath-eval, not the other way). Will push the rename + structure change today.

Copy link
Copy Markdown
Author

@rakesh-uipath rakesh-uipath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call on the namespace — agreed, using uipath.eval is cleaner and follows the same pattern as uipath-core / uipath-platform.

Here's the plan to address this:

  1. Move uipath-eval/src/uipath_eval/uipath-eval/src/uipath/eval/
  2. Update pyproject.toml: packages = ["src/uipath_eval"]packages = ["src/uipath"]
  3. Update all internal imports uipath_eval.uipath.eval.
  4. Remove the main uipath package's src/uipath/eval/ entirely (same as how uipath-core fully owns uipath.core) — uipath-eval becomes the canonical owner of the uipath.eval namespace

One thing I want to flag before pushing: the runtime integration files (eval/runtime/runtime.py, _evaluate.py, etc.) depend on uipath.runtime from the main package. Moving them into uipath-eval creates a circular dep (uipathuipath-evaluipath). Planning to use lazy imports in those files so they degrade gracefully when uipath isn't installed (standalone use case). Will push the changes shortly — want to make sure the approach looks right to you first.

@radu-mocanu
Copy link
Copy Markdown
Collaborator

Moving them into uipath-eval creates a circular dep (uipath → uipath-eval → uipath). Planning to use lazy imports in those files so they degrade gracefully when uipath isn't installed (standalone use case)

@rakesh-uipath there is no real cycle. Importing uipath.runtime does not import the uipath facade (uipath.runtime lives in a separate package https://github.com/UiPath/uipath-runtime-python). you just need to add uipath-runtime as a dependency here.

tldr; we have an acyclic dependency graph: uipath → uipath-eval → uipath-runtime → uipath-core.
Lazy imports are the wrong tool regardless as they turn install-time errors into prod ImportErros on the first code path that hits the missing module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-llamaindex Triggers tests in the uipath-llamaindex-python repository

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants